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Abstract 

We  study  the  problem  of  estimating  rigid  motion  from  a  sequence  of  monocular 
perspective  images  obtained  by  navigating  around  an  object  while  fixating  a  particular 
feature  point.  The  motivation  comes  from  the  mechanics  of  the  human  eye,  which  either 
pursuits  smoothly  some  fixation  point  in  the  scene,  or  “saccades”  between  different 
fixation  points.  In  particular,  we  are  interested  in  understanding  whether  fixation 
helps  the  process  of  estimating  motion  in  the  sense  that  it  makes  it  more  robust, 
better  conditioned  or  simpler  to  solve. 

We  cast  the  problem  in  the  framework  of  “dynamic  epipolar  geometry”,  and  pro¬ 
pose  an  implicit  dynamical  model  for  recursively  estimating  motion  from  fixation.  This 
allows  us  to  com  pare  directly  the  quality  of  the  estimates  of  motion  obtained  by  impos¬ 
ing  the  fixation  constraint,  or  by  assuming  a  general  rigid  motion,  simply  by  changing 
the  geometry  of  the  parameter  space  while  maintaining  the  same  structure  of  the  re¬ 
cursive  estimator.  We  also  present  a  closed-form  static  solution  from  two  views,  and  a 
recursive  estimator  of  the  absolute  attitude  between  the  viewer  and  the  scene. 

One  important  issue  is  how  do  the  estimates  degrade  in  presence  of  disturbances 
in  the  tracking  procedure.  We  describe  a  simple  fixation  control  that  converges  expo¬ 
nentially,  which  is  complemented  by  a  image  shift-registration  for  achieving  sub-pixel 
accuracy,  and  assess  how  small  deviations  from  perfect  tracking  affect  the  estimates  of 
motion. 


When  a  rigid  object  is  moving  in  front  of  us  (or  we  are  moving  relative  to  it),  the  information 
coming  from  the  time- varying  projection  of  the  object  onto  one  of  our  eyes  suffices  to  estimate 
its  motion,  even  when  its  shape  is  unknown. 

*Research  sponsored  by  NSF  NYI  Award,  NSF  ERC  in  Neuromorphic  Systems  Engineering  at  Caltech, 
ONR  grant  N00014-93- 1-0990.  This  work  is  registered  as  CDS  technical  report  n.  CIT-CDS  95-006,  February 
1995. 
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In  order  to  observe  the  motion  of  the  object  while  holding  our  head  still  and  one  eye 
closed,  we  can  choose  either  to  track  it  (or  a  particular  feature  on  its  surface)  by  moving  the 
eye,  or  to  hold  the  eye  still  (by  fixating  some  feature  in  the  still  background),  and  let  the 
object  cross  our  field  of  view.  When  it  is  us  moving  in  the  environment  (or  “object”),  our 
eye  constantly  “holds”  on  some  particular  feature  in  the  scene  (smooth  pursuit)  or  “jumps” 
between  different  features  (saccadic  motion). 

From  a  geometric  point  of  view  there  is  no  difference  between  the  observer  moving  or  the 
object  moving,  and  the  problem  of  estimating  rigid  motion  from  a  sequence  of  projections  is 
by  now  fairly  well  understood.  In  this  paper  we  explore  how  the  fixation  constraint  modifies 
the  geometry  of  the  problem,  and  whether  it  facilitates  the  task. 

This  problem  has  been  in  part  addressed  before  in  the  literature  of  computational  vision. 
In  [6,  5],  the  fixation  constraint  is  exploited  for  recovering  the  Focus  of  Expansion  (FOE) 
and  the  time-to-collision  using  normal  optical  flow,  and  then  computing  the  full  ego-motion, 
including  the  portion  due  to  the  fixating  motion.  In  [12],  a  pixel  shift  in  the  image  is  used 
in  order  to  derive  a  constraint  equation  which  is  solved  using  static  optimization  in  order 
to  recover  ego-motion  parameters,  similarly  to  what  is  done  in  [3,  10].  However,  nowhere 
in  the  literature  is  the  estimation  of  motion,  performed  by  imposing  the  fixation  constraint, 
directly  compared  with  the  estimation  of  a  general  rigid  motion,  due  to  the  lack  of  a  common 
framework.  More  seriously,  most  of  the  algorithms  assume  that  perfect  tracking  of  the 
fixation  point  has  been  performed,  and  it  is  not  assessed  how  they  degrade  in  the  presence 
of  inevitable  tracking  errors. 

In  this  paper  we  study  the  motion  estimation  problem  in  the  framework  of  dynamic 
epipolar  geometry,  and  assess  how  such  geometry  is  modified  under  the  fixation  assumption. 
Since  dynamic  motion  estimation  schemes  have  been  proposed  in  the  framework  of  epipolar 
geometry  [11],  we  modify  them  in  order  to  embed  the  fixation  assumption.  As  a  result, 
we  can  directly  compare  the  estimates  obtained  by  enforcing  the  fixation  constraint  with 
the  estimates  obtained  by  assuming  general  rigid  motion.  We  also  assess  analytically  how 
(small)  perturbations  of  the  fixation  constraint  affects  the  quality  of  the  estimates,  and  we 
perform  simulation  experiments  in  order  to  probe  the  boundaries  of  validity  of  the  fixation 
model. 


1.1  Scenario 

We  will  consider  a  system  with  a  camera  mounted  on  a  two- degrees  of  freedom  actuated 
joint  (the  eye)  standing  on  a  platform  which  is  moving  freely  (with  6  degrees  of  freedom) 
in  the  environment  (the  head),  as  in  figure  1.  The  architecture  of  the  overall  system  is 
composed  of  two  parts:  an  inner  control  loop  that  actuates  the  eye  as  to  maintain  a  given 
feature  in  the  center  of  the  image-plane  or  to  saccade  to  a  different  fixation  point  given  from 
a  higher-level  decision  system;  an  estimator  then  reconstructs  the  relative  motion  between 
the  eye  and  the  object  which  is  due  to  the  motion  of  the  head  within  the  environment.  These 
estimates  can  then  be  used  in  order  to  elaborate  control  actions  with  different  tasks,  such  as 
obstacle  avoidance,  “optimal”  estimation  of  structure,  target  pursuing  etc.  . 

The  overall  functioning  of  the  scheme  can  be  summarized  as  follows  (see  figure  1): 

1.  Select  features. 
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Figure  1:  Overall  setup  of  motion  from  fixation:  an  inner  tracking  loop  controls  the  two 
degrees  of  freedom  of  the  eye  as  to  maintain  a  given  feature  in  the  center  of  the  image. 
The  images  are  then  fed  into  the  motion  estimation  algorithm  that  recursively  estimates  the 
motion  of  the  head  within  the  environment.  The  estimates  can  possible  be  fed  back  to  the 
head  in  order  to  accomplish  different  control  tasks  such  as  navigation,  inspection,  docking 
etc.  (outer  dashed  loop). 

2.  Select  a  target  or  fixation  point.  This  could  be  the  feature  closest  to  the  center  of  the 
image,  or  the  best-conditioned  feature,  or  the  focus  of  expansion,  or  the  singularity  in 
the  motion  field  or  any  other  location  assigned  from  a  higher-level  system. 

3.  Control  the  gaze  of  the  eye  to  the  fixation  point.  Simple  control  strategies  can  be 
implemented,  such  as  a  one-step  deadbeat,  or  control  on  the  sphere  with  exponential 
convergence.  The  kinematics  and  geometry  of  the  eye  mechanism  must  be  included  in 
the  model  (it  will  he  a  change  of  coordinates  in  the  state-space  sphere),  the  dynamics 
can  be  neglected  in  a  first  approximation. 

4.  Fine-tune  fixation  by  shifting  the  origin  of  the  image-plane. 

5.  Track  features  between  successive  time  instants.  This  process  (the  correspondence 
problem)  is  greatly  facilitated  by  two  facts.  First,  since  we  fixate  one  point  in  the 
visible  object,  features  only  move  little  in  the  image,  and  always  remain  within  the 
field  of  view.  Second,  knowledge  of  the  motion  of  the  camera  from  the  actuators  helps 
predicting  the  position  of  the  features  at  successive  frames. 


3 


6.  Go  to  3.  (Inner,  fast  tracking  loop). 

7.  Estimate  relative  motion  between  the  object  and  the  viewer.  Both  velocity  or  absolute 
orientation  can  be  estimated.  Check  the  quality  of  tracking. 

8.  Possibly  take  control  action  on  the  head  in  order  to  achieve  specified  tasks  (outer  loop). 

We  will  only  briefly  describe  the  realization  of  the  inner  control  loop  (the  “tracking”  or  “fix¬ 
ation”  loop),  which  consists  of  a  control  system  defined  on  a  two-sphere,  with  measurements 
in  the  real  projective  plane  (section  1.2).  This  problem  is  well-understood  and  extensive 
literature  is  available  on  the  topic  (see  [4]  and  references  therein).  The  rest  of  the  paper 
assumes  that  tracking  has  been  performed  within  some  level  of  accuracy  and  analyizes  the 
problem  of  estimating  the  remaining  degrees  of  freedom.  In  section  2  we  review  the  setup 
of  epipolar  geometry  and  show  how  it  is  modified  by  the  fixation  assumption.  In  section  3 
we  show  how  the  epipolar  representation  can  be  used  in  order  to  formulate  dynamic  (re¬ 
cursive)  estimators  of  motion.  The  fixation  assumption  modifies  the  parameter  space,  but 
not  the  structure  of  the  estimator,  which  makes  it  possible  to  compare  motion  estimators 
embedding  the  fixation  constraint,  with  estimators  of  general  rigid  motions.  We  present 
both  a  closed-form  solution  from  two  views  and  a  recursive  solution  based  upon  the  epipolar 
representation.  In  section  5  we  describe  a  model  for  estimating  absolute  attitude  under  the 
fixation  constraint. 

While  it  is  evident  that  fixation  reduces  the  number  of  degrees  of  freedom,  and  therefore 
the  estimator  following  the  tracking  loop  will  operate  on  a  smaller-dimensional  space  and 
hence  be  more  constrained,  it  is  not  trivial  to  assess  how  possible  imprecisions  in  the  tracking 
stage  propagate  onto  the  estimation  stage.  In  section  4  we  assess  the  sensitivity  of  the 
estimates  with  respect  to  the  fixation  constraint  ,and  define  a  measure  of  “goodness  of 
tracking”  that  can  be  performed  during  the  estimation  phase. 

In  section  6  we  substantiate  our  analysis  with  simulation  experiments  on  noisy  synthetic 
image  sequences. 

1.2  Fixation  control 

The  task  of  the  inner  tracking  loop  is  that  of  keeping  a  given  point  in  the  center  of  the  image 
plane.  Equivalently,  we  can  enforce  that  a  given  direction  (projection  ray)  in  ]R3  coincides 
with  the  optical  axis  (see  figure  2).  In  order  to  do  so,  we  can  act  on  two  motors  that  drive 
the  joint  on  top  of  which  the  camera  is  mounted.  If  we  call  [9  <j>]T  the  angles  at  the  joint 
which  describe  the  local  coordinates  of  the  state  s  of  the  eye  on  the  sphere,  and  Ui  and  u2 
the  torques  applied  to  the  motors,  then  the  geometry,  kinematics  and  dynamics  of  the  eye 
can  be  described  as  a  nonlinear  dynamical  system  of  the  form: 

S  =  f(s,u)  5  e  S2.  (1) 

If  we  call  x0  the  spherical  coordinates  of  the  target  point  in  the  reference  centered  in  the 
optical  center  of  the  camera,  with  the  Z-axis  along  the  optical  axis,  then  the  motion  of  the 
camera  s(t)  induces  a  vectorfield  of  the  form 

Xo  =  tf(x0,s)  x0  £  S2.  (2) 
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However,  we  cannot  measure  directly  the  spherical  coordinates  of  the  target  point,  since  it 
is  projected  on  a  flat  image-plane,  rather  than  on  a  spherical  retina  (figure  2).  In  fact,  the 
actual  measure  is  a  local  diffeomorphism 

7T  :  S2  1RP2 

x0  i->  y0.  (3) 

Our  overall  dynamic  model  can  be  therefore  summarized  as 

'i  =  /(s,u)  s  e  s2 

'  X0  =  g(xo,  s)  x0  €  s2  (4) 

L  yo  =  x(x0)  +  n0  y0  e  IRP2 

where  n0  is  a  noise  term  due  to  the  uncertainty  in  the  tracking  procedure.  The  goal  of  the 
inner  tracking  module  can  then  be  expressed  as  follows: 

take  the  control  action  u(t )  such  that  yo (t)  — »  [0  0  1]  €  1RP2  exponentially  as 
t  —>  oo. 


When  we  neglect  the  dynamics  of  the  eye,  and  we  assume  that  we  are  able  to  act  on  the 
velocity  of  the  joints  through  our  actuators,  we  can  simplify  our  model  into  one  of  the  form 

J  x0  =  u  x0  e  S2 

{  y0  =  Mx o)  +  no  yo  €  HP2 

which  we  can  write  in  local  coordinates,  provided  that  yo  is  close  enough  to  /j(x0),  as 

f  x0  =  u  x0  €  1R2 
1  y0  =  h(x o)  +  n0  y0  G  H2 

where  h  comprises  a  change  of  coordinates  in  the  sphere  and  the  perspective  projection. 

From  the  above  expression  it  is  immediate  to  formulate  a  proportional  control  law  with 
exponential  convergence  to  the  target  fixation  point  y0  either  in  the  workspace, 

««,(x,yo)  =  kp  (/i_1(y0) -x)  ,  (7) 

or  in  the  output  space,  represented  for  simplicity  as  the  two-sphere 


(5) 


(6) 


u0(x,  y0)  =  Jh{x)kpvG(x,  y0)  (8) 

where  kp  is  the  proportional  constant,  Jh  is  the  jacobian  of  h: 

T  .  .  dh ,  . 

A(x)  =  ^(x)  (9) 

and  vG  is  the  geodesic  versor 

(ft(x)Ay0)AA(x) 

X,y°^_  d 

with  d  =  arcos(<  /i(x),  y0  >)  the  distance  between  the  output  and  the  target  along  the 
geodesic  [4]. 

Exponential  convergence  is  required  as  a  mean  of  contrasting  noise.  In  fact,  if  the  control 
is  fast,  it  can  dump  disturbances  at  a  rate  faster  than  they  arrive,  which  helps  the  system 
not  to  diverge  in  the  presence  of  noise  and  disturbances.  The  above  controls  can  be  easily 
shown  to  generate  exponential  convergence  to  the  desired  goal  [4]. 
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1.3  Tracking  and  shift  registration 

The  purpose  of  the  eye  motion  control  is  to  keep  a  prescribed  feature  at  the  origin  of  the 
image  plane  using  two  degrees  of  freedom  of  the  spherical  joint  of  the  eye.  In  principle, 
tracking  of  the  target  feature  could  be  accomplished  locally  by  shifting  the  origin  of  the 
image-plane  at  each  step,  provided  that  the  feature  remains  within  the  field  of  view  (see 
figure  2).  In  general,  a  combination  of  the  two  techniques  is  to  be  employed.  The  eye  is 
rotated  in  order  to  maintain  the  target  feature  as  close  as  possible  to  the  center  of  the  image, 
then  the  image  plane  is  shifted,  with  a  purely  “software”  operation,  in  order  to  translate  the 
origin  of  the  image-plane  on  the  target  feature.  Provided  that  the  feature  tracking  scheme 
achieves  sub-pixel  accuracy  [2],  the  shift-registration  allows  us  to  perform  the  tracking  within 
one  pixel  accuracy  on  the  image-plane. 


Figure  2:  Tracking  amounts  to  controlling  the  camera  as  to  bring  one  specified  feature-point 
in  the  origin  of  the  image  plane.  The  same  task  can  be  accomplished  locally  by  shifting  the 
image-plane,  a  purely  software  operation.  The  two  operations  are  equivalent  locally  to  the 
extent  in  which  the  target  feature  does  not  exit  the  held  of  view. 


2  Epipolar  geometry  under  fixation 

In  the  present  section  we  analyze  the  functioning  of  the  second  stage  of  the  scheme  depicted 
in  figure  1,  which  consists  of  estimating  the  relative  motion  between  the  viewer  and  the 
object  being  fixated.  Since  one  point  of  the  object  is  still  in  the  image  plane,  the  object  is 
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free  only  to  rotate  about  this  point,  and  to  translate  along  the  fixation  line.  Therefore  there 
are  overall  4  degrees  of  freedom  left  from  the  fixation  loop. 

We  start  off  with  studying  how  the  well-known  setup  of  the  epipolar  geometry  is  trans¬ 
formed  under  the  fixation  conditions. 


Figure  3:  Imaging  geometry.  The  viewer-reference  is  centered  in  the  center  of  projection, 
with  the  Z-axis  pointing  along  the  ptical  axis.  The  object  reference  frame  is  centered  in  the 
fixation  point.  Under  the  fixation  conditions  the  object  can  only  rotate  about  the  fixation 
point  and  translate  along  the  fixation  axis. 


2.1  Notation 

We  call  X  =  [  X  Y  Z  ]  G  1R3  the  coordinates  of  a  generic  point  P  with  respect  to  an 
orthonormal  reference  frame  centered  in  the  center  of  projection,  with  Z  along  the  optical 
axis  and  X,  Y  parallel  to  the  image  plane  and  arranged  as  to  form  a  right-handed  frame  (see 
figure  3).  The  relative  attitude  between  the  camera  and  the  object  (or  scene)  is  described 
by  a  rigid  motion  g  €  SE( 3). 

=*  P(t  +  l)  =  t+1g0tg;1P(t)  (ll) 

where  Tg0  G  SE(3)  is  the  change  of  coordinates  between  the  viewer  reference  frame  at  time  r 
and  the  object  coordinate  frame  centered  in  the  fixation  point  P0(t)  =  [0  0  d(t)]T .  Since  we 
are  interested  in  the  displacement  relative  to  the  moving  frame  (ego-motion),  we  can  assume 


pf(t)=  VP1 

P*‘(i  +  1)  =  t+1g0°  P* 
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that  the  object  reference  is  aligned  with  the  viewer  reference  at  time  t,  so  that  we  can  write 
the  relative  orientation  between  time  t  and  I,  -f  1  in  coordinates  as 


/ 

0 

\ 

0 

X4'(t  +  1)  =  R(t)  X4'(f)  - 

V 

0 

d(t ) 

)  + 

0 

d(t  +  1) 

which  we  will  write  as 

X\t  +  1)  =  i2(<)X’'(t)  +  d(t)T(R,  v) 


(12) 

(13) 


where 


T{R,») 


—  Rl3 
—  R23 
—  R33  +  v 


(14) 


and 


v  = 


d(t  +  1) 
d{t ) 


7^0 


(15) 


is  the  relative  velocity  along  the  fixation  axis.  The  matrix  R  G  50(3)  is  an  orthonormal 
rotation  matrix  that  describes  the  change  of  coordinates  between  the  viewer’s  reference  at 
time  I  and  that  at  time  t  +  1  relative  to  the  object.  T  G  1R3  describes  the  translation  of  the 
origin  of  the  viewer’s  reference  frame. 

What  we  are  able  to  measure  is  the  perspective  projection  7r  of  the  point  features 
onto  the  image  plane,  which  for  simplicity  we  represent  as  the  real  projective  plane.  The 
projection  map  w  associates  to  each  p  yf  0  its  projective  coordinates  as  an  element  of  1RP2: 


7T  :  IR3  -  {0}  -4  ]RP2 

X  ^  x  =  [  f  I  1  ]T'  (16) 

We  usually  measure  x  up  to  some  error  n,  which  is  well  modeled  as  a  white,  zero-mean  and 
normally  distributed  process  with  covariance  Rn: 


y  =  x  +  n  n  G  jV"(0,  Rn). 


Due  to  the  fixation  constraint,  the  camera  is  only  allowed  to  translate  along  the  fixation 
axis,  rotate  about  the  fixation  axis  (cyclorotation)  and  move  on  a  sphere  centered  in  the 
fixation  point  with  radius  equal  to  the  distance  from  the  fixation  point  to  the  optical  center. 
Therefore  there  are  4  degrees  of  freedom  in  the  velocity.  These  can  also  be  easily  seen  from 
the  object  reference  frame:  the  object  reference  is  free  to  rotate  about  the  fixation  point  (3 
degrees  of  freedom)  but  can  only  translate  along  the  fixation  axis  (1  degree  of  freedom). 

In  eq.  (13),  these  4  degrees  of  freedom  are  encoded  into  R(t)  (3  DOF)  and  v(t)  (1 
DOF).  However,  note  that  also  the  distance  from  the  fixation  point  d(t)  enters  the  model. 
The  epipolar  constraint,  which  will  be  derived  in  the  next  subsection,  involves  only  relative 
orientation  and  measured  projections,  while  it  gets  rid  of  the  3-D  structure  and  of  the 
absolute  distance  d. 


Figure  4:  Coplanarity  constraint:  the  coordinates  of  each  point  in  the  reference  of  the 
viewer  at  time  t,  the  coordinates  of  the  same  point  at  time  t+1  and  the  translation  vector 
are  coplanar. 

2.2  Coplanarity  constraint 

The  well-known  coplanarity  constraint  (or  “epipolar  constraint”,  or  “essential  constraint”) 
of  Longuet-Higgins  [8]  imposes  that  the  vectors  T(R(t),v(t)),  XJ(i  + 1)  and  X*(i)  be  coplanar 
for  all  t  and  for  all  points  P‘  (figure  4).  The  triple  product  of  the  above  vectors  is  therefore 
zero;  if  we  multiply  both  sides  of  (13)  by  ftX'(f  +  l)r(TA),  where  a  €  M  -  {0},  we  get 

0  =  X\t  +  l)(TA)R(t)X\t)  (17) 

which  we  will  write  as 

X\t  +  1)Q  {t)X\t)  =  0  (18) 

with 

Q(f)  =  Q (R(t),v(t))  =  (T(R(t),v(t)))  A  R(t).  (19) 

We  will  use  the  notation  Q(t)  when  emphasizing  the  time-dependence,  while  we  will  use 
Q (R,v)  when  stressing  the  dependence  of  Q  from  the  3  rotation  parameters  contained  in 
R  and  from  the  relative  velocity  along  the  fixation  axix  v.  Note  that  Q  is  an  element  of  a 
4-dimensional  differentiable  manifold  which  is  embedded  in  1R9,  since  Q  is  realized  as  a  3  x  3 
matrix. 

Since  the  coordinates  of  each  point  X*(t)  and  their  projective  coordinates  x‘(t)  span  the 
same  direction  in  1R3,  the  constraint  (18)  holds  for  x*  in  place  of  X*  (just  divide  eq  (18)  by 
Xi(t  +  l)Xi(t)): 

x*(t  +  l)Q(t)x*(t)  =  0  Vt ,  \/i.  (20) 
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2.3  Structure  of  the  essential  manifold 


For  a  generic  T  €  1R3  and  a  rotation  matrix  R ,  the  matrix  Q  =  (T A)R  belongs  to  the 
so-called  “essential  manifold” 


E  =  {SR  I  S  €  so(3),  R  e  50(3)},  (21) 

which  can  be  characterized  as  the  tangent  bundle  to  the  rotation  group  T50(3 )  [11].  Under 
the  fixation  constraint,  T  has  a  special  structure  which  restricts  Q  to  a  submanifold  of  the 
essential  manifold.  In  this  section  we  study  the  geometry  of  such  a  submanifold  induced  by 
the  fixation  constraint.  We  have  already  seen  that  the  dimension  of  the  space  reduces  from 
6  down  to  4,  since  two  degrees  of  freedom  are  used  in  order  to  keep  the  projection  of  the 
fixation  point  still  in  the  image  plane. 

After  some  simple  algebra,  it  is  easy  to  see  that 

Q(R,v)  =  RSt  +  vSR  (22) 

where 

0  —a  0 

S  =  a  0  0  (23) 

_  0  0  0 

and  a  is  an  unknown  scaling  factor  due  to  the  homogeneous  nature  of  the  coplanarity  con¬ 
straint.  If  we  restrict  the  essential  matrices  Q  6  E  to  have  unit  norm  (as  in  the  definition 
of  the  “normalized  essential  manifold”  [11]),  then  a  is  fixed  to  be  a  =  Note  that  this 
arbitrary  scaling  does  not  affect  neither  the  relative  velocity  v  (which  is  already  a  scaled 
parameter)  nor  the  rotation  matrix  R.  We  will  see  in  section  2.4  that  a  =  yqjj  is  a  necessary 
choice  in  order  to  avoid  singularities  in  the  representation.  Under  the  fixation  constraint, 
both  the  essential  manifold  Q  and  its  normalized  version  p|j  belong  to  a  four-dimensional 
submanifold  of  the  essential  manifold  E.  The  essential  matrix  is  therefore  defined,  under 
the  fixation  constraints,  by  the  Sylvester’s  equation  (22),  with  strongly  structured  unknowns 
R  G  50(3)  and  v  G  lit.  Other  equivalent  expressions  can  be  derived  as  follows,  assuming 
a  —  1: 


Q  =  ( RSTRT  +  vS)  R 

0  1  \  °]\ 

0  +  v  0  A  R 

lJ  L 1 J/ 

Q=[-R.2  I  R.1  |  o]+v 


—  Rl2  ~  vR2l 
—R22  +  vRn 

—  f?32 


R11  —  vR22 

R2I  +  VR12 

R?,i 


- r2.  -- 

-  Rl  - 
0 

-vR23  ' 
vRl3 
0 


(24) 

(25) 

(26) 
(27) 


Another  useful  way  of  writing  the  epipolar  constraint  can  be  derived  as  follows.  Since  the 
constraints  (20)  are  linear  in  the  components  of  the  essential  matrix  Q,  we  can  reorder  them 
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as 


X(t)  Q  =  0  (28) 

where  x(t)  is  a  N  x  9  matrix  which  depends  on  the  measurements  x8(t),x;(t  +  1)  whose 
generic  row  can  be  written  as 

Xi.  =  [  X'(t+i)X*(t)  X'(t  +  i)X>(t|  Xj(t  +  i)  Xj(t  +  l)Xj(t)  X|(t  +  i)XJ(t)  X>(t  +  lJ  Xj(t)  X'(t)  i  ]  (29) 


Q  is  now  interpreted  as  a  9-dimensional  column  vector  obtained  by  stacking  the  rows  of  Q 
one  on  top  of  each  other.  It  is  easy  to  verify  that  the  above  can  be  written  as  follows: 


x(t)S(v)R  =  a 


(30) 


where 


<%) 


£ 

vl 

0 


-vl  0 

5  0 

0  5 


(31) 


is  a  skew-symmetric,  9x9  matrix  with  rank  8  which  depends  only  upon  the  translational 
velocity  v.  I  is  the  3-dimensional  identity  matrix  and  R  is  the  usual  rotation  matrix  now 
interpreted  as  a  nine-dimensional  column  vector  obtained  by  stacking  the  rows  of  R  on  top 
of  each  other.  We  will  not  make  a  distinction  between  3x3  matrices  and  9— dimensional 
column  vectors,  whenever  it  is  clear  from  the  context  which  representation  is  employed. 
Since  both  the  last  row  and  the  last  column  of  S  are  identically  zero,  we  can  delete  them 
along  with  the  last  column  of  x  and  the  last  element  of  R.  which  is  now  interpreted  as  a 
8-dimensional  column-vector. 

From  the  above  characterizations  of  the  essential  matrix  constrained  by  the  fixation  hy¬ 
pothesis  it  is  possible  to  draw  some  interesting  conclusions.  In  particular,  by  left-multiplying 
the  above  equation  by  [0  0  1],  we  anihilate  the  second  (rightmost)  term  of  the  right  hand-side 
of  (22),  while  the  column  vector  [0  0  1]T  anihilates  the  leftmost  term,  if  right-multiplied. 
From  this  simple  observation  we  can  derive  a  necessary  condition  which  acts  as  a  consistency 
check  for  the  quality  of  fixation: 

Q33  =  0.  (32) 


In  general,  from  a  number  of  point  matches,  we  can  derive  an  approximate  estimate  of  the 
matrix  p^jj  which,  due  to  noise,  will  be  such  that  Q33  /  0;  later  in  section  4  we  will  see  how 
IQ33I  gives  a  measure  of  how  accurate  the  inner  tracking  loop  is. 


2.4  Singularities  and  normalization  of  the  epipolar  representa¬ 
tion 

In  the  characterizations  of  the  essential  matrices  described  in  the  previous  section,  the  un¬ 
known  scaling  factor  has  been  taken  into  account  by  fixing  the  scalar  a  =  1,  and  therefore 
the  matrix  Q  is  uniquely  defined.  However,  there  is  a  continuum  of  possible  motions  which 
correspond  to  the  essential  matrix 

QKO)  =  0  (33) 
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Figure  5:  Epipolar  setup.  Under  the  fixation  constraint,  both  the  centers  of  projection  at 
time  t  and  t+1,  and  the  optical  centers  of  the  two  cameras  lie  on  the  same  plane ,  the  epipolar 
plane.  The  intersection  of  the  epipolar  plane  with  the  image  planes  is  the  epipolar  line.  The 
epipolar  plane  is  invariant  after  fixation,  for  the  camera  can  only  translate  along  the  plane, 
and  rotate  about  a  direction  orthogonal  to  it. 


in  particular 


v  =  1  Tt  — 


0 

0 

9 


#E[0,7r)  =4*  Q(u,fi)  =  0 


(34) 


since  Q  =  (T  A)enA  with  T  —  0,  and  therefore  all  motions  consisting  of  pure  cyclorotation 
(rotation  about  the  optical  axis  or  fixation  axis)  generate  a  zero  essential  matrix  or  an 
undefined  normalized  essential  matrix. 

If  we  know  that  motion  occurs  only  about  the  optical  axis,  we  can  easily  estimate  the 
amount,  of  rotation  9  by  solving  in  a  lest-squares  sense  the  rigid  motion  equations  (12),  which 
reduce,  in  the  case  of  pure  cyclorotation,  to 


x,-(<  +  1) 


0 

0 

1 


A  8 


X;(<). 


(35) 


In  order  to  get  rid  of  the  singularity  just  mentioned,  we  need  to  normalize  the  essential 
matrices.  Since  the  epipolar  constraint  is  defined  up  to  a  scale,  it  can  be  arbitrarily  multiplied 
by  a  constant.  In  particular,  if  we  multiply  it  by  we  get  rid  of  the  singularity,  since  the 
translation  vector  T  is  constrained  to  be  of  unit  norm.  Note  that  we  do  not  loose  any  degree 
of  freedom  in  the  representation,  for  the  scaling  does  not  affect  the  motion  parameters  v,  ft. 
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In  section  3.3  we  will  see  that  this  representation  affects  the  convergence  of  the  filter  for 
estimating  motion  when  away  from  the  singular  configuration.  When  the  object  purely 
rotates  about  the  optical  axis,  the  translation  vector  is  undefined;  we  will  see  in  section  3.3 
how  it  is  possible  to  sort  out  this  situation. 


3  Estimation  from  the  epipolar  constraint 

The  epipolar  constraint,  with  the  addition  of  the  fixation  assumption,  can  be  used  in  order 
to  estimate  the  4  free  parameters  (three  for  rotation  and  one  for  relative  translation  along 
the  fixation  axis).  The  first  solution  we  propose  is  a  closed-form  solution  which  is  correct  in 
the  absence  of  noise,  but  is  far  from  being  efficient  in  the  presence  of  uncertainty,  since  the 
structure  of  the  epipolar  constraint  is  not  imposed  in  the  estimation. 

The  second  solution  is  a  more  correct  one,  for  it  enforces  the  structure  of  the  epipolar 
constraint  during  the  estimation.  It  consists  of  a  dynamic  estimator  in  the  local  coordinates 
of  the  essential  manifold.  The  constraints  are  enforced  by  construction  and  the  structure 
of  the  parameter  manifold  is  exploited,  while  the  computation  is  carried  out  by  an  Implicit 
Extended  Kalman  Filter  (IEKF)  in  the  lines  of  [11]. 


3.1  Closed-form,  two-frames  solutions 

Consider  N  visible  points  P!,  Vi  =  1 ...  IV,  and  the  N  corresponding  scalar  constraints  (20). 
The  constraints  are  linear  in  the  components  of  Q,  and  can  be  used  for  estimating  a  generic 
3x3  matrix  Q  which  is  least-squares  compatible  with  the  measurements,  in  the  same  way 
as  [8,  13,  11]. 

Once  the  matrix  Q  has  been  estimated,  we  can  derive  a  set  of  constraints  for  the  com¬ 
ponents  of  the  rotation  matrix  R.  Just  for  the  sake  of  simplicity,  assume  that  we  represent 
the  rotation  matrix  locally  using  Euler  angles  a  /  0,  ^  0  and  7  /  0: 


R  =  RzHRyi^Rzi-y)  = 


SaS'y  CaCpS^y  SaC^y  CaSp 

SaCfiCsy  “j~  Ca^'y  y  *4“  Ca&y  ^a^(3 

SfiC'y  SpS'y  Cj3 


where  Rz{ot)  indicates  a  rotation  about  the  Z— axis  of  a  radiants 


ca 

0 


0 

ca  0 

0  1 


(36) 


(37) 


and  similarly  for  Ry(/3)  and  Rz{l)-  From  the  above  expression  of  R ,  and  the  expression  for 
Q  given  in  eq.  (27),  it  is  immediate  to  solve  for  the  Euler  angles: 


arctanf 

V  Q23; 

(38) 

arcsin^Q^  +  Q|2 

(39) 
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arctan 


Qsi 

Q32 


(40) 

(41) 


provided  that  Q23  7^  0  and  Q32  ^  0.  It  is  immediate  to  see  that  Q23  =  Q23  =  0  only  if 
rotation  occurs  only  about  the  optical  axis  with  an  angle  9  =  a  +  7.  In  such  a  case,  equation 
(27)  becomes 

•so(l  -  v )  q(1  -  u)  0 

Q  =  —c$(  1  -  v)  ^(1  -  u)  0  (42) 

0  00 

and  we  can  solve  for  6 

9  =  7  +  a  =  arctan  [  vr— )  (43) 

\Ql2  / 

provided  that  Q12  7^  0,  in  which  case  we  have  a  =  j3  =  7  =  0.  Once  the  rotation  parameters 
have  been  estimated,  the  translation  parameter  v  can  be  recovered  from  the  other  elements 
of  Q.  For  instance,  when  =  0, 


Alternatively,  one  may  start  with  a  different  local  coordinate  parametrization  of  R,  for 
example  the  exponential  coordinatization 

R  =  eQA  (45) 

and  plug  the  result  into  equation  (22),  which  can  then  be  solved  for  the  three  unknowns 
Oi  . . .  fi3  using  an  iterative  optimization  method  such  as  a  gradient  descent. 

It  must  be  stressed  that  these  methods  do  not  enforce  the  structure  of  the  parameter  space 
during  the  estimation  process.  Rather,  generic,  iron-structured  parameters  are  estimated, 
and  then  their  structure  is  imposed  a-posteriori  in  order  to  recover  an  approximation  of  the 
desired  estimates. 

The  epipolar  constraints  can  also  be  used  for  formulating  nonlinear  filters  that  estimate 
the  motion  components  over  time,  while  taking  into  account  the  geometry  of  the  parameter 
space.  This  is  done  in  the  next  section. 

3.2  Implicit  dynamical  filter  for  motion  from  fixation 

Consider  the  local  parametrization  of  the  essential  matrix  Q(i2,  y),  which  is 

£=[)j]eK4  (46) 

where  f l  €  JR3  is  defined  for  ||0||  €  [0, 7r)  by  the  equation  [9] 

efiA  =  R.  (47) 
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We  can  write  a  dynamic  model  in  the  local  coordinates  of  the  essential  manifold,  having  as 
implicit  measurement  constraints  the  epipolar  constraint  (20)  where  the  matrix  Q  is  now 
expressed  as  a  function  of  the  local  coordinates,  Q(£): 

I  xl'(t  +  l)TQ(£(i))x*(f)  =  0  £  €  1R4  f48) 

\y  i(t)  =  xi(t)  +  ni{t)  \/i  =  l...N.  1  ; 


Estimating  motion  amounts  to  identifying  the  parameters  £  from  the  above  model.  This  can 
be  done  using  the  local  identification  procedure  presented  in  [11],  which  is  the  IEKF  based 
upon  the  model 

I  ((*  + 1)  =  £(<)  +  "eW  f4„x 

lyi(i  +  i)TQ(«<))y'(i)  =  fi.(f)  Vi  =  l...JV  lwJ 

where  the  second  order  statistic  of  the  residual  n  is  computed  according  to  [11].  An  alter¬ 
native  way  of  writing  the  above  model  is 


f  £(«  +  l)  =£(<)+«{(<) 

lx(<)5(6)iJ(6.£3.£4)  =  o. 


the  equations  of  the  estimator,  as  derived  from  [11],  are: 


prediction  step: 


i(t+m  =  im  l(oio)  =  6 

P{t  +  l|t)  =  P(t\t)  +  Q ^ 


(61) 

(52) 


where  Q %  is  the  variance  of  the  noise  n ^  driving  the  random  walk  model  and  is  intended 
as  a  tuning  parameter,  and  P  is  the  variance  of  the  estimation  error  of  the  filter. 

update  step: 


£(*  +  l|t  +  l)  =  £{t  +  l\t)  +  L(t  +  1) 


r(t  +  i)TQ(t(t  +  i\t))y(t) 


(53) 


P(t  +  1  Jt  +  1)  —  r(f  +  1  )P{t  +  l|t)r(i  +  1)  +  L(t  +  1  )RnL  (t  +  1)  (54) 

where  Lit  +  1)  is  the  Extended  Kalman  Gain  [7],  and  T  =  /  —  LC ,  with  C  =  - 

v  ’  1  ’  3£|(t+i|t) 


3.3  Dealing  with  singularities  in  the  representation 

In  section  2.4  we  have  pointed  out  a  singularity  in  the  non-normalized  epipolar  representation 
when  the  relative  motion  between  the  scene  and  the  object  consists  of  pure  rotation  about 
the  optical  axis.  This  phenomenon  is  to  be  expected,  for  pure  rotation  about  the  optical 
axis  generates  zero  ego-motion  translation 


T 


'  0  ' 

'  0  ■ 

0 

= 

0 

V 

0 

(55) 
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which  is  a  singular  configuration  for  the  motion  estimation  problem  [11].  As  long  as  there 
is  a  non-zero  translation  (that  is,  as  long  as  there  is  some  components  of  rotation  about 
an  axis  non  corresponding  to  the  optical  axis),  the  constraints  are  well-defined.  However, 
serious  problems  may  occur  while  estimating  motion  even  when  the  motion  parameters  are 
far  away  from  the  singular  point. 

In  order  to  visualize  that,  we  can  imagine  the  innovation  of  the  filter  as  living  on  a 
residual  surface  that  maps  some  particular  motion  v,  0  onto  IR^  when  N  feature  points  are 
visible.  The  filter  will  try  to  update  the  state  t),  as  to  reach  the  minimum  of  the  residual. 
Of  course  the  motion  that  generated  the  data  v,  0  corresponds  to  a  minimum  of  the  residual 
surface  (it  would  be  zero  in  absence  of  noise).  However,  also  the  location  v  =  1,  0  =  [0  0  9}T 
corresponds  to  a  zero  of  the  residual,  which  is  a  hole  in  the  residual  surface.  Therefore  the 
filter  must  be  able  to  reach  the  minimum  without  falling  into  the  singularity  (see  figure  6). 

This  can  be  done  provided  that  the  initial  conditions  are  close  to  the  minimum  of  the 
residual  surface  corresponding  to  the  true  motion.  However,  in  the  presence  of  high  measure¬ 
ment  noise  levels,  the  residual  surface  becomes  increasingly  more  irregular,  and  eventually 
the  filter  falls  into  the  singularity.  This  effect  will  be  illustrated  in  the  experimental  section, 
where  we  will  show  that  in  the  presence  of  high  noise  levels,  the  filter  initialized  fare  enough 
from  the  true  value  of  the  state  falls  into  the  singularity,  the  innovation  goes  to  zero  and  the 
variance  of  the  state  increases. 


Figure  6:  Singularity  in  the  non-normalized  epipolar  representation.  The  residual  surface, 
where  the  innovation  of  the  filter  takes  values,  has  a  minimum  corresponding  to  the  true  mo¬ 
tion,  but  also  a  minimum  corresponding  to  cyclorotation.  The  filter  must  be  able  to  converge 
to  the  true  minimum  without  falling  into  the  singularity.  The  normalized  epipolar  represen¬ 
tation  is  a  way  of  getting  rid  of  the  singularity,  for  the  translation  vector  is  constrained  to 
having  unit  norm. 
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One  way  of  getting  rid  of  this  singularity  is  to  use  the  normalized  essential  matrix,  which 
corresponds  to  dividing  the  epipolar  constraint  by  the  norm  of  translation.  This  eliminates 
the  singularity,  since  T  is  constrained  to  having  unit  norm.  However,  the  motion  correspond¬ 
ing  to  pure  cyclorotation  gives  an  essential  matrix  which  is  undefined,  and  therefore  the  filter 
will  give  arbitrary  estimates. 

In  order  to  sort  out  the  case  of  pure  rotation  about  the  optical  axis,  we  can  first  try  to 
fit  a  0  into  the  purely  cyclo-rotational  model 

x(i  +  1)  =  Rz(0)x(t).  (56) 

If  the  residual  is  big  enough  it  means  that  rotation  is  not  purely  about  the  optical  axis. 
Therefore  the  translation  induced  in  the  viewer’s  reference  is  non-zero,  and  the  normalized 
epipolar  constraint  is  well-defined.  We  will  see  in  the  experimental  section  how  the  filter 
based  upon  the  normalized  epipolar  representation  performs  where  the  non-normalized  filter 
would  fall  into  the  singularity. 


4  Vergence  control,  quality  of  fixation  and  sensitivity 
of  constraints 

One  may  argue  that,  in  the  proposed  architecture,  the  estimation  scheme  that  follows  the 
fixation  loop  is  “blind”,  in  the  sense  that  it  cannot  reject  disturbances  due  to  imperfect 
tracking.  In  the  present  section  we  analyze  how  the  estimation  algorithm,  is  modified  in  the 
presence  of  non-perfect  tracking,  and  how  it  can  assess  the  quality  of  the  fixation. 

We  will  consider  two  different  kinds  of  non-perfect  tracking.  One  in  which  the  two  optical 
axes  (at  time  t  and  t  +  1)  intersect  at  a  point  which  is  not  the  desired  fixation  point,  and 
one  in  which  the  two  optical  axes  do  not  intersect  at  all. 

4.1  Vergence  control 

Let  us  assume  that  the  optical  axis  of  the  camera  at  time  t  intersects  the  optical  axis  at 
time  t  +  1  in  a  “vergence  point”  which  is  different  from  the  desired  fixation  point  (see 
figure  5).  Consider  the  plane  determined  by  the  two  centers  of  projection  and  the  optical 
center  (fixation  point)  in  the  camera  at  time  /.,  which  is  called  the  epipolar  plane  at  time 
t.  If  the  optical  axes  intercept,  there  must  exist  one  point  on  the  projection  of  the  optical 
axis  of  the  camera  at  time  t  which  passes  through  the  optical  center  of  the  camera  at  time 
t  +  1.  Equivalently,  the  optical  center  at  time  t  - f  1  must  belong  to  the  epipolar  plane.  It  is 
immediate  to  see  that  this  can  happen  only  if  the  direction  of  rotation  is  orthogonal  to  the 
direction  of  translation,  which  is  constrained  to  belong  to  the  epipolar  plane  (see  fig.  5).  In 
brief,  the  epipolar  plane  is  invariant  under  the  vergence  conditions. 

Therefore,  under  the  vergence  conditions,  we  can  identify  one  point  P0  at  the  intersection 
of  the  optical  axes,  for  which  the  fixation  constraint  is  satisfied,  although  it  is  not  the  desired 
fixation  point.  From  Chasles’  theorem  [9]  we  can  conclude  that  the  algorithm  proposed  in 
the  previous  section  estimates  the  motion  of  the  object  relative  to  the  point  P0,  rather  than 
relative  to  the  desired  fixation  point.  If  the  mismatch  between  the  target  point  and  the 
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actual  vergence  point  is  e  along  the  epipolar  line,  then  the  mismatch  along  the  optical  axis  is 
approximately  de,  where  d  is  the  distance  between  the  optical  center  and  the  target  fixation 
point. 

A  natural  question  to  ask  at  this  point  is  how  the  algorithm  following  the  fixation  loop 
can  verify  whether  the  vergence  conditions  are  satisfied  and,  if  they  are  not,  send  a  feedback 
signal  to  the  fixation  loop. 

4.2  Vergence  conditions,  quality  of  fixation 

When  the  optical  axes  do  not  intercept,  the  epipolar  constraint  is  not  satisfied  for  the  optical 
center.  The  vergence  constraint  between  two  time  instants  can  be  expressed  by  saying  that 

the  two  optical  axes  intersect  3  Xo  such  that  Xo (t)  =  [0  0  1]T  =£-  x0(t  +  1)  = 

[0  0  1]T. 

It  is  immediate  to  verify  that  the  above  conditions  hold  if  and  only  if  the  direction  of 
translation  or  orthogonal  to  the  direction  of  rotation.  Indeed,  a  more  synthetic  condition 
that  can  be  derived  by  observing  that 

the  optical  axes  intersect  Q33  =  0. 

In  fact,  clearly  if  the  optical  axes  intersect,  the  optical  center  xq  must  satisfy  the  epipolar 
constraint: 

x0(i  +  l)rQx0(f)  =  0  =>  Q33  =  0.  (57) 

Vice-versa,  assume  that  Vxo,  the  condition  Xo (t  +  1)  7^  [0  0  1]T  implies  Xo(t)  7^  [0  0  1]T  while 
Q33  =  0.  Write  x0(t  +  1)  as  [a  (3  1]T  with  af3  7^  0.  Then  the  epipolar  constraint  must  be 
violated  for  all  correspondence  points  of  the  form  [0  0  1] 1 : 

[a  13  1]Q[0  0  1]T  7^  0  =»  «Q13  +  /3Q23  +  Q33  +  0.  (58) 

If  Q13  =  Q23  =  0,  then  we  conclude  that  Q33  7^  0,  from  which  the  contradiction.  If  at  least 
one  of  Q13,  Q23  7^  0,  by  choosing  a  =  —  Q23,  /?  =  Q13,  we  conclude  again  Q33  7^  0,  which 
contradicts  the  hypotheses. 

Therefore,  when  the  vergence  conditions  are  not  satisfied  and  the  optical  axes  do  not 
intersect,  the  scalar  IQ33I  is  a  measure  of  the  quality  of  vergence.  From  a  geometrical  point 
of  view,  Q33  is  the  volume  of  the  parallelepiped  with  sides  equal  to  the  translation  vector, 
the  optical  axis  of  the  camera  at  time  t  and  the  one  at  time  t  +  1. 

Since  at  each  step  we  can  estimate  the  matrix  Q  from  all  the  visible  points,  we  could  use 
Q33  as  a  sensory  signal  to  be  fed-back  to  the  fixation  loop.  This  would  allow  us  to  design 
a  vergence  control  that  exploits  all  the  visible  features,  rather  than  the  projection  of  the 
fixation  point  alone.  This  issue  is  not  explored  in  the  present  paper  and  is  an  object  of 
future  research. 
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4.3  Sensitivity  and  degradation  of  the  constraint 

In  the  previous  sections  we  have  treated  the  problem  of  motion  estimation  as  an  identification 
task  where  the  class  of  models  was  determined  by  the  epipolar  constraint  under  the  fixation 
assumption.  We  now  want  to  ask  ourselves:  suppose  the  actual  process  generating  the  date 
does  not  exactly  fall  within  the  given  class  of  models,  how  do  small  deviations  from  the  class 
affect  the  quality  of  the  estimates? 

More  specifically,  suppose  that  our  camera  is  not  tracking  exactly  the  fixation  point.  The 
measurements  we  get  from  the  image  plane  do  not  satisfy  the  epipolar  constraint  of  eq.  (22) 
for  any  choice  of  the  parameters.  However,  if  the  deviation  from  the  constraints  is  small,  we 
would  like  our  estimates  to  deviate  little  from  the  true  motion  parameters. 

Suppose  that  our  measurements  are  generated  for  an  object  which  rotates  about  the 
fixation  point  with  ft,  translates  along  the  fixation  axis  by  v  and  also  drifts  away  from  the 
fixation  point  with  some  velocities  e%  and  e2  along  X  and  Y  respectively.  Therefore  the 
model  generating  the  data  looks  like 


( 

0 

\ 

x*(t  +  l,e)  =  R( 0)  Xl(t)  - 

0 

+ 

C2 

\ 

.  d(t)  _ 

J 

d(t  +  1) 

where  we  measure 

x*(t,  e)  =  7r(X,:(^,  e))  (60) 

which  we  collect  into  the  matrix 

x(M)  (61) 

as  in  equation  (29).  For  e  =  0  the  epipolar  constraint  is  satisfied  by  the  actual  motion 
parameters  v,  0: 

x(t,0)S{v)R(Sl)  =  0  (62) 

where  S  and  R  are  a  9  x  9  matrix  and  a  9— vector  defined  as  in  (30).  However,  in  the  presence 
of  disturbances  e,  there  is  no  element  in  the  class  of  models  that  satisfies  the  constraints,  i.e. 

Ve  >  0,  x(t,  e)S(v)R(&)  1R3.  (63) 

At  this  point,  assuming  e  small,  we  may  seek  for  the  perturbations  v  =  v  —  Sv  and  0  =  fl  —  Sfl 
that  make  the  above  residual  zero  up  to  second  order  terms: 


Sv,  Sfl  =  arg  min  || x(t,  e)S(v  —  Sv)R(Q,  —  dO)| 


(64) 


This  is  essentially  the  task  of  the  recursive  filter  described  in  the  previous  sections,  where 
the  process  to  be  minimized  is  the  innovation.  Expanding  around  the  zero-perturbation 
conditions,  we  have 


x(t,e)S(v  - 

-xM) 


6v)R(n  -  sn) 

^(v)R(n)6v 


=  x(t,  o )S(v)R(n)  +  ~-S(v)R(fl)ei  +  ^S(v)R(n)e2  + 

~  x(t,  0)S(v)—(tt)6tt  +  0(6,  Sv,  SQ).  (65) 


19 


We  can  now  find  the  perturbations  8v  =  £u(e,  u,fl)  and  SQ  =  8fl(e,v,Q)  that  make  the 
residual  zero  up  to  higher  order  terms  from 


'  t,SR  JJSi?  ]  e  =  X(i,  0) 


dS  D  qdR 
dv11  °  dQ 


8v 

8tt 


(66) 


which  we  will  write  as 


B(v ,  0)e  =  *4(u,  ft) 


8v 

8n 


(67) 


The  N  x  4  matrix  A  loses  normal  column  rank  only  at  the  singular  configuration  v  =  1, 
ft  =  [0  0  0]T  for  all  6  6  [0,  rr).  However,  this  configuration  does  not  belong  to  the  state-space 
of  the  filter,  for  it  has  been  eliminated  by  the  normalization  constraint.  Therefore  we  can 


conclude 


8v 

8ft 


=  (^A7  Al)  ATBe  =  C(v,fl)e 


(68) 


and  the  induced  norm  of  the  matrix  C(u,0)  is  a  measure  of  the  “gain”  between  (small) 
disturbances  in  the  constraints  (or  drifts  outside  the  model  class)  and  the  errors  in  the 
estimates.  In  the  experimental  section  we  will  show  the  result  of  a  simulation  where  the 
disturbance  level  was  increased  up  to  the  point  in  which  the  filter  based  upon  the  fixation 
constraint  did  not  converge. 


5  Attitude  estimation  from  fixation 

In  some  cases  it  may  be  desirable  to  reconstruct  not  only  the  relative  velocity  between  the 
object  being  fixated  and  the  viewer,  but  also  their  relative  configuration,  in  the  lines  of  [1]. 
Of  course  the  relative  configuration,  assuming  the  initial  time  as  the  base  frame,  can  be 
obtained  by  integrating  velocity  information,  and  this  is  indeed  the  only  feasible  solution 
when  the  motion  of  the  viewer  induces  drastic  changes  in  the  image,  such  as  occlusion, 
appearance  of  new  objects  etc.  . 

While  in  most  applications  the  scene  changes  significantly  and  we  cannot  assume  that 
the  same  features  are  visible  over  extended  periods  of  time,  in  the  case  of  fixation  we  can 
assume  that  the  object  stays  in  the  field  of  view  and  we  can  integrate  structure  information 
from  the  same  features  to  the  extent  in  which  they  are  visible. 

Notice  that,  while  in  all  the  previous  cases  involving  estimation  of  velocity  (or  relative 
configuration  in  the  moving  frame),  we  could  decouple  the  motion  parameters  from  the 
structure  and  therefore  formulate  filters  involving  only  motion  parameters  and  measured 
projections,  in  the  case  of  the  absolute  orientation,  it  is  necessary  to  include  structure  in  the 
state  of  the  filter. 

The  fixation  assumption  gives  the  strong  constraint  that  the  motion  of  the  object  being 
fixated  rotates  about  the  fixation  point  and  translate  along  the  fixation  axis.  This  results 
in  the  fact  that  the  object  remains  in  the  field  of  view  as  long  as  we  fixate  it.  Therefore  we 
will  adopt  an  object-centered  model,  where  the  coordinates  of  each  point  are  constant  over 
time: 

°P8  =  const.  (69) 
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Since  we  measure  the  projection  of  the  coordinates  of  the  point  in  the  reference  frame  of 
the  camera,  we  can  enforce  that  the  coordinates  relative  to  the  camera  reference  at  the  first 
instant  are  constant: 


=  const 


(70) 


which  relates  to  the  measured  projection  via 
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where  tRto  is  the  relative  orientation  between  the  viewer  reference  at  time  t  and  the  same 
reference  frame  at  the  initial  time  to. 

We  may  conceive  at  this  point  a  dynamic  model  having  the  trivial  constant  dynamics  of 
the  points  in  the  state,  and  the  above  projection  as  the  measurement  constraint.  In  order 
to  do  so,  we  have  to  insert  tRto  and  d(t ),  along  with  their  derivatives,  into  the  state  of  the 
filter,  which  becomes  therefore  3 /V  -f  8-dimensional: 


t0Pl(t  +  l)  =  t°Pi(t)  *°P8(0) 
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where  7 r  denotes  an  ideal  perspective  projection.  In  the  case  of  weak-perspective,  the  last 
measurement  equation  transforms  into 
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There  is  an  additional  constraint  that  can  be  imposed  in  order  to  set  the  overall  scaling, 
which  is 


f0P°(i) 
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Vt. 


(74) 


The  above  can  be  imposed  either  as  a  measurement  constraint,  or  as  a  model  constraint  by 
setting  the  variance  of  the  corresponding  state  to  zero,  as  in  [1], 

The  above  model  may  be  reduced  into  a  minimal  one  by  removing  the  dynamics  of  the 
absolute  orientation  d(t),R(t),  and  by  exploiting  the  fact  that 
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Since  we  measure  the  initial  projection  of  each  feature  point,  we  can  leave  only  the  scaling 
(initial  depth)  Zq  in  the  state.  It  must  be  noticed,  however,  that  the  error  in  the  location 
of  the  initial  features  is  propagated  through  time,  since  we  do  not  update  the  states  corre¬ 
sponding  to  the  measured  projections.  If  one  is  willing  to  trade  the  drift  due  to  the  initial 
measurement  error  with  eliminating  2N  states  from  the  model,  he  ends  up  with  the  following 
system 


(  Z>0(t  +  1)  =  Z'0(t)  4(0)  =  1 

Sl(t  +  1)  =  Sl(t)  +  nn(t)  n(0)  =  0 
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l  Z°0(t)  =  1 


where  R(t)  and  d(t)  are  computed  from  the  states  fl(t)  and  v(t)  at  each  time  by  integrating 


f  R{t  +  1)  R(t)  =  en0)A  #(o)  =  I 
\  d(t  +  1)  =  d(t)  +  v(t)  d( 0)  =  1. 


A  simple  EKF  based  upon  the  model  above  recovers  the  structure  modulo  the  initial  distance 
from  the  fixation  point  d0.  If  such  a  distance  is  known,  it  is  possible  to  recover  the  full 
structure,  as  well  as  the  motion  parameters  0(f)  and  v(T). 


6  Experiments 

6.1  Experimental  conditions 

In  order  to  test  the  effectiveness  of  the  schemes  proposed,  and  compare  it  against  equivalent 
motion  estimation  techniques  that  do  not  take  into  account  the  fixation  constraint,  we  have 
generated  a  cloud  of  dots  within  a  cubic  volume  at  d  —  2m  in  front  of  the  viewer.  These 
dots  are  projected  onto  and  ideal  image  plane  with  unit  focal  length  and  500  x  500  pixels, 
corresponding  to  a  visual  angle  of  approximately  30°.  Noise  has  been  added  to  the  projec¬ 
tions  with  1  pixel  std,  corresponding  to  the  average  performance  of  current  feature  tracking 
techniques  [2].  One  random  point  in  the  cloud  is  chosen  as  the  fixation  point,  and  the  cloud 
is  then  made  rotate  about  this  point  and  translate  along  the  fixation  axis  with  smooth  but 
non-constant  velocity. 

6.2  Recursive  filters 

In  figure  7  (top-left),  the  4  components  of  the  state  of  the  filter  described  in  section  3.2 
are  plotted,  along  with  the  ground  truth  in  dotted  lines.  The  plot  on  the  right  shows  the 
absolute  estimation  error. 

The  same  data  have  been  fed  to  the  essential  filter  [11],  which  estimates  5  states  cor¬ 
responding  to  the  direction  of  translation  and  the  rotational  velocity  without  enforcing  the 
fixation  constraint.  The  states  corresponding  to  the  same  motion  described  above,  as  long 
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Figure  7:  (top-left)  Estimates  of  the  4-dimensional  state  of  the  filter  lor  estimating  relative 
orientation  under  the  fixation  constraint.  Filter  estimates  are  in  solid  lines,  while  ground 
truth  is  in  dotted  lines.  The  estimation  error  (top-right)  is  smooth  and  strongly  correlated, 
which  is  a  symptom  of  poor  tuning  of  the  filter.  If  we  do  not  enforce  the  fixation  constraint, 
we  need  to  estimate  5  motion  parameters.  The  filter  which  does  not  enforce  the  fixation 
constraint  converges  faster  (bottom-left)  and  the  estimation  error  is  larger  but  far  less  cor¬ 
related  (bottom-right),  which  indicates  that  the  potential  limits  of  the  scheme  have  been 
achieved. 

In  our  preliminary  set  of  experiments  we  have  observed  a  higher  robustness  level  in  the 
filter  enforcing  the  fixation  constraint.  For  example,  the  maximum  noise  level  tollerable  by 
the  filter  not  enforcing  the  fixation  constraints  in  this  particular  experimental  setup  is  1.5 
pixels,  while  the  filter  enforcing  fixation  performs  up  to  2.5  pixels,  as  reported  in  figure  8. 


6.3  Attitude  estimation 

In  figure  9  we  report  the  estimates  of  the  absolute  orientation  and  structure  as  estimated  by 
the  filter  described  in  section  5.  The  structure  parameters  (initial  depth  of  all  points)  has 
been  plotted  against  the  true  parameters,  assuming  that  the  initial  distance  of  the  fixation 
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Figure  8:  (left)  Convergence  of  the  states  of  the  filter  enforcing  the  fixation  constraint  for  a 
noise  level  in  the  feature  tracking  of  3  pixels.  The  filter  that  does  not  enforce  the  fixation 
constraint  does  not  converge  in  the  same  experimental  situation.  Initial  conditions,  tuning 
of  the  filters  and  noise  levels  are  the  same  for  both  filters. 

point  is  known.  In  general,  structure  can  be  recovered  only  up  to  a  scale  factor.  The  four 
motion  components  are  also  plotted,  along  with  the  estimation  error,  in  the  right  plot. 

It  must  be  noticed  that  this  filter  has  a  N  +  4-dimensional  state,  unlike  the  one  described 
above  which  has  dimension  4.  Furthermore,  the  filter  has  proven  very  sensitive  to  the  initial 
conditions  in  the  motion  parameters,  while  the  structure  parameters  can  be  safely  initialized 
to  1,  which  corresponds  to  having  the  visible  objects  flat  on  the  image  plane.  The  error 
is  significantly  correlated  and  convergence  is  slow  for  the  motion  parameters,  which  are 
observable  only  through  2  levels  of  bracketing  with  the  state  equation. 

In  case  occlusions  occur  in  the  image  plane  or  some  features  disappear  or  exit  the  field  of 
view,  it  is  necessary  to  resort  to  the  schemes  described  in  section  3.2,  unless  we  are  willing 
to  deal  with  a  filter  with  a  variable  number  of  states. 

6.4  Singularities  and  normalization 

As  we  have  mentioned  in  section  2.4,  the  non-normalized  epipolar  representation  contains  a 
singularity  in  v  —  1,  ft  =  [0  0  0]T,  where  the  innovation  of  the  filter  becomes  zero.  Therefore, 
even  when  motion  does  not  correspond  to  pure  translation  about  the  optical  axis  (the  singular 
configuration),  the  filter  may  converge  to  the  singular  configuration  whenever  initialized  far 
enough  from  the  true  state.  In  particular,  when  the  noise  level  increases,  the  residual  surface 
becomes  more  and  more  irregular,  and  it  becomes  easier  for  the  filter  to  fall  into  the  singular 
configuration. 

In  figure  10  (left)  we  show  the  state  of  the  filter  that  is  initialized  far  from  the  true 
inital  conditions  for  a  measurement  noise  level  of  1  pixel.  The  filter  converges  to  a  state 
corresponding  to  v  =  1  and  0  =  [0  0  9]T  with  some  6.  Correspondingly,  the  innovation  goes 
to  zero  (fig.  10  right)  and  the  filter  saturates.  The  variance  of  the  estimation  error  keeps 
increasing  after  the  filter  has  saturated.  In  figure  10  (bottom)  we  plot  the  state  with  errorbars 
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Figure  9:  (top-left)  Estimates  of  the  N  +  4- dimensional  state  of  the  filter  for  estimating 
absolute  orientation  and  structure.  Success  in  the  estimation  process  depends  crucially  on 
the  initial  conditions  of  the  motion  parameters  (bottom-left),  while  the  structure-parameters 
can  be  safely  initialized  to  1,  which  corresponds  to  having  the  visible  objects  Eat  on  the 
image-plane.  The  estimation  error  ( top-right )  is  strongly  correlated  and  decays  slowly.  The 
estimation  error  for  the  motion  parameters,  initialized  within  1%  off  the  true  values,  is 
plotted  in  (bottom-right)  for  comparison  with  the  relative  motion  estimation  scheme. 

corresponding  to  the  diagonal  elements  of  the  variance/ covariance  matrix  of  the  estimation 
error.  It  can  be  seen  that,  after  the  variance  decreases  due  to  the  initial  convergence  towards 
the  minimum,  it  keeps  increasing  steadily  once  the  filter  has  saturated. 

When  the  same  initial  conditions  and  noise  levels  are  applied  to  the  filter  based  upon  the 
normalized  essential  matrices,  convergence  is  achieved  without  any  problems  of  saturation 
(figure  11). 

6.5  Sensitivity  to  the  fixation  constraint 

In  order  to  experiment  with  the  degradation  of  the  filter  enforcing  the  fixation  constraint  in 
presence  of  motions  that  violate  the  fixation  assumptions,  we  have  perturbed  the  experiments 
described  above  by  translating  the  cloud  on  a  plane  orthogonal  to  the  fixation  axis  at  random 
within  a  standard  deviation  ranging  from  1%  to  6%  of  the  norm  of  the  essential  matrix.  We 
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have  started  from  the  true  initial  conditions  and  added  no  noise  to  the  measurements.  For 
each  level  of  disturbance,  we  have  run  100  experiments,  and  computed  the  estimation  error 
for  the  translation  along  the  fixation  axis  and  for  the  rotation  components.  The  results  are 
plotted  in  figure  12,  where  we  show  the  average  error  across  different  trials,  with  the  standard 
deviation  showed  as  an  errorbar.  The  results  seem  to  confirm  that  the  degradation  of  the 
estimates  is  graceful  for  small  disturbances.  However,  when  the  disturbance  exceeds  6%  of 
the  overall  norm  of  the  current  relative  motion,  the  filter  does  not  reach  convergence. 

7  Conclusions 

We  have  studied  the  problem  of  estimating  the  motion  of  a  rigid  object  viewed  from  a 
monocular  perspective  camera  which  is  actuated  as  to  track  one  particular  featiire-point  in 
the  scene.  We  have  cast  the  problem  in  the  framework  of  epipolar  geometry,  and  formulated 
both  closed-form  and  recursive  schemes  for  recursively  estimating  motion  and  attitude  using 
the  fixation  constraint.  The  framework  of  dynamic  epipolar  geometry  allows  us  to  compare 
the  proposed  scheme  directly  against  the  equivalent  scheme  that  does  not  enforce  the  epipolar 
constraint.  Also,  the  degradation  of  the  performance  in  the  presence  of  disturbance  in  the 
fixation  hypothesis  is  assessed. 

The  performance  of  the  estimators  have  been  compared  via  simulations  to  the  equivalent 
estimation  schemes  that  does  not  enforce  the  fixation  constraint.  The  results  seem  to  indicate 
that  using  the  fixation  constraint  helps  achieving  better  accuracy,  in  the  presence  of  perfect 
tracking.  Degradation  of  the  performance  in  the  presence  of  disturbance  in  the  fixation 
constraint  is  graceful  for  small  disturbances.  It  will  be  subject  to  future  research  to  study 
how  to  compensate  for  non-perfect  tracking  by  feeding  back  a  measure  of  “goodness  of 
fixation”  and  performing  a  shift-registration  of  the  origin  of  the  image  plane. 
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Figure  10:  (top-left)  Convergence  of  the  filter  to  the  singular  configuration.  For  a  noise  level 
of  1  pixel  and  the  initial  conditions  far  enough  from  the  true  values,  the  state  of  the  filter 
ends  up  in  the  minimum  of  the  residual  surface  corresponding  to  cyclorotation  (all  states  are 
zero  but  fl3  which  is  arbitrary).  Correspondingly  the  innovation  becomes  zero  (top-right) 
and  the  variance  increases  (bottom  plot).  The  variance  is  represented  via  the  errorbars  in 
the  motion  estimates,  which  are  the  diagonal  elements  of  the  variance/covariance  matrix  of 
the  estimation  error. 
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Figure  11:  (top-left)  Convergence  of  the  filter  enforcing  the  normalization  constraint.  There 
are  no  singular  configurations  in  the  state  manifold,  and  the  filter  converges  fast  to  the 
correct  estimate.  The  innovation  is  small  but  non-zero  (top-right),  and  the  variance  of  the 
state  decreases  as  time  grows  (bottom). 
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Figure  12:  Estimation  error  versus  disturbances  in  the  fixation  constraint.  The  plots  show  the 
average  over  100  trials,  with  the  standard  deviation  across  trials  shown  as  an  errorbar.  When 
the  fixation  constraint  is  violated  by  adding  spurious  translation  components  ranging  from 
1  to  6  percent  of  the  norm  of  the  fixating  motion,  the  estimation  error  increases  gracefully. 
In  the  left  plot  the  estimation  error  for  the  translation  along  the  optical  axis,  on  the  right 
the  norm  of  the  estimation  error  for  the  rotational  velocity. 
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